Learning to Separate Object Sounds by Watching Unlabeled Video

نویسندگان

Ruohan Gao

Rogerio Feris

Kristen Grauman

چکیده

Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to learn audio-visual object models from unlabeled video, then exploit the visual context to perform audio source separation in novel videos. Our approach relies on a deep multi-instance multi-label learning framework to disentangle the audio frequency bases that map to individual visual objects, even without observing/hearing those objects in isolation. We show how the recovered disentangled bases can be used to guide audio source separation to obtain better-separated, object-level sounds. Our work is the first to study audio source separation in large-scale general “in the wild” videos. We obtain state-of-the-art results on visually-aided audio source separation and audio denoising.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Time-Contrastive Networks: Self-Supervised Learning from Video

We propose a self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints, and study how this representation can be used in two robotic imitation settings: imitating object interactions from videos of humans, and imitating human poses. Imitation of human behavior requires a viewpoint-invariant representation that ca...

متن کامل

Anticipating the future by watching unlabeled video

In many computer vision applications, machines will need to reason beyond the present, and predict the future. This task is challenging because it requires leveraging extensive commonsense knowledge of the world that is difficult to write down. We believe that a promising resource for efficiently obtaining this knowledge is through the massive amounts of readily available unlabeled video. In th...

متن کامل

The Role of Avatar in Interactive Fictional World of Video Games

In third-person video games, players are able to move and progress in the interactive world of the game while watching their avatar from an external point of view. The purpose of this paper is to investigate the role of avatar in the interactive imaginary world of video games using double vision theory. This article is based on descriptive-analytical methods and the use of library data and imag...

متن کامل

Large-Scale Object Discovery and Detector Adaptation from Unlabeled Video

We explore object discovery and detector adaptation based on unlabeled video sequences captured from a mobile platform. We propose a fully automatic approach for object mining from video which builds upon a generic object tracking approach. By applying this method to three large video datasets from autonomous driving and mobile robotics scenarios, we demonstrate its robustness and generality. B...

متن کامل

Visualizing Video Sounds through Sound Word Animation

Sound information in video plays an important role in constructing audience experience. On the other hand, there are many circumstances where the audience cannot watch video with sounds. Subscripts are conventionally used as visual aids to provide the missing sound information. However, conventional subscripts are far less expressive for non-verbal sounds since it is designed to visualize speec...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2018

Learning to Separate Object Sounds by Watching Unlabeled Video

نویسندگان

چکیده

منابع مشابه

Time-Contrastive Networks: Self-Supervised Learning from Video

Anticipating the future by watching unlabeled video

The Role of Avatar in Interactive Fictional World of Video Games

Large-Scale Object Discovery and Detector Adaptation from Unlabeled Video

Visualizing Video Sounds through Sound Word Animation

عنوان ژورنال:

اشتراک گذاری